{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "51194982-9bd9-4b79-85a3-16cbdf36d63b",
   "metadata": {},
   "source": [
    "# Agent Evaluation\n",
    "\n",
    "In this section, we will explore the evaluation of agentic systems. Agentic systems are complex constructs consisting of multiple sub-components. In Lab 3, we examined a simple singleton agent orchestrating between two tools. Lab 4 demonstrated a more sophisticated multi-layer agentic system with a top-level router agent coordinating multiple agentic sub-systems.\n",
    "\n",
    "One direct implication of the interdependent and potentially nested nature of agentic systems is that evaluation can occur at both macro and micro levels. This means that either the entire system as a whole is being evaluated (macro view) or each individual sub-component is assessed (micro view). For nested systems, this applies to every level of abstraction.\n",
    "\n",
    "Typically, evaluation begins at the macro level. In many cases, positive evaluation results at the macro level indicate sufficient agent performance. If macro-level performance evaluation proves insufficient or yields poor results, micro-level evaluation can help decompose the performance metrics and attribute results to specific sub-components.\n",
    "\n",
    "In this lab, we will first explore macro-level agent performance evaluation. Then, we will evaluate tool usage as a lower-level task. To maintain focus on the evaluation process, we will keep the agentic system simple by utilizing the singleton agent composed in Lab 3."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "64169599-95e0-4b76-a81e-c68d479eb79a",
   "metadata": {},
   "source": [
    "## Agent Setup\n",
    "\n",
    "As covered in the previous section, we will reuse the Agent we built in Lab 3. This agent has access to tools designed to help find vacation destinations. You'll be able to interact with the agent by asking questions, observe it utilizing various tools, and engage in meaningful conversations.\n",
    "\n",
    "Let's begin by installing the required packages."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a910fdd3-006b-42b1-88f3-9fee1bc80707",
   "metadata": {},
   "source": [
    "### Util functions part 1 - importing singleton agent\n",
    "\n",
    "To maintain a clean and focused approach in this notebook, we have moved the agent creation logic to a module in `utils.py`. The `create_agent` function replicates the agent creation process of the simple ReAct agent we developed in Lab 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "783694df-e31d-448a-af7b-0c3fc6747c27",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-14T16:49:43.995354Z",
     "start_time": "2025-02-14T16:49:43.630869Z"
    }
   },
   "outputs": [],
   "source": [
    "from IPython.core.display import HTML\n",
    "from utils import create_agent\n",
    "from IPython.display import display_markdown, Markdown\n",
    "\n",
    "HTML(\"<script>Jupyter.notebook.kernel.restart()</script>\")\n",
    "agent_executor = create_agent()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3776a1d-7566-4652-acd2-7eed21f83d59",
   "metadata": {},
   "source": [
    "The ```create_agent``` function returns a ```CompiledStateGraph``` object that represents the Agent from Lab 3's scenario. \n",
    "Now, let's proceed to visualize this graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f21d0ae3-13ca-4bdc-bcc8-087eece6b610",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-14T16:49:47.550288Z",
     "start_time": "2025-02-14T16:49:47.236186Z"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAANYAAAD5CAIAAADUe1yaAAAAAXNSR0IArs4c6QAAIABJREFUeJztnXdcU1fDx8/NXgRI2ES2LKWiAg5wr8f5AFqtaNVWW7WOp3W0tbWt2uqjdmmntlr33uKDggqiWHFVqgytbBnBQCAhITv3/SO+lGJA1NycG3K+H/+IGef8Al/OvffcMzAcxwECAQ8K7AAIewcpiIAMUhABGaQgAjJIQQRkkIIIyNBgB3gR5FKdvE7XJDcoG/V6rW10K9HoGJWGcRyoHD5N6MlgcaiwE5EFzDZ+gQAAACSV6qI/lSV5Si6fZtDjHD6V60BjsCnAFr4BjYkp6vVNjYYmuV4pM3Adqf7duV0jeTxnOuxokLENBWV1ut9P11LpmLMbw78b18WbCTvRy1JZpCrJVUrFGidXRv/xQhrdfs+IbEDB62frHtxq7D/BJagHD3YWy/Pn5Ybfk+sGJLh07+8IOwscyK7g0c0V3WP5oVF82EGI5UaqtFGqGzbVHXYQCJBXQRzHf1lRPGGul6c/G3YWa5B/XV6apxzzpifsINaGvAr+/H7hjJV+XL5NXrO/GPdvynN/l0/6jwh2EKtCUgWPbqqIjRd6+tlF+9eSe1dldVWawa+6wQ5iPch4IZadUhcxgG+H/gEAImIdOQ7Ughty2EGsB+kUrH+sLcxRhPTu5Ncf7dBrmPOlIxLYKawH6RT8Pbmu/3gh7BQwodEpvYc7Xz9bBzuIlSCXguJSNZNNCYjohP1/z0XMKIG4VK3TGmEHsQbkUrDorkLgwbBadbm5uRqNBtbH24fFpZbkKgkqnFSQS8GSPKV/N6516kpOTp41a5ZKpYLy8Wfi352LFLQ29Y+1fAHN2d1KreALN2Cmbizi2j8TARFcWZ2O0CpIAokUlNXqMAwjouSysrJ58+bFxcWNGTNm3bp1RqMxOTl5/fr1AIDhw4dHRUUlJycDAHJychYuXBgXFxcXFzd37tyCggLTxxsaGqKiovbs2bNy5cq4uLi33nrL7MctC41OUTTolTK9xUsmGyS699AkN3D4hIyi+/zzz0tLS5cuXapUKm/dukWhUGJjY6dPn753795NmzbxeDwfHx8AQFVVlUajmTNnDoVCOXLkyOLFi5OTk1kslqmQ7du3v/rqq1u2bKFSqe7u7k9/3OJw+TSlXM91JNHviAhI9PWUcj1Bt+OqqqpCQ0MTEhIAANOnTwcACAQCkUgEAOjevbuTk5PpbaNHjx4zZozpcXh4+Lx583Jycvr27Wt6JiIiYsGCBc1lPv1xi8N1pCplBtCFoOLJAokUBACnMQk5EI8ZM2bnzp0bN26cM2eOQCBo620YhmVkZOzdu7ekpITD4QAA6ur+7pyLiYkhIls7MFlU3EjG26eWhUTngmwurVFKyKnPggULlixZkpaWNmHChMOHD7f1tm3bti1fvjw8PPybb7559913AQBG4989c2y2tW8YNtRqOXYwSoNECnL41Ca5gYiSMQxLSko6derUoEGDNm7cmJOT0/xS8ygNjUazY8eO+Pj4pUuXRkZGRkREdKRkQgd5EHdyTCpIpKCDgE4n5kBs6kDhcrnz5s0DANy/f7+5VZNIntyNValUGo0mLCzM9N+GhoZWrWArWn2cCBwENAenzt8KkugbunozKwtVigY9z9I/9w8++IDH4/Xt2zcrKwsAYPKsR48eVCr1q6++mjBhgkajmThxYlBQ0MGDB4VCoUKh+OWXXygUSmFhYVtlPv1xy2YuzVfSGRSMQsjfJKmgrlq1CnaGv2mQ6HRqo5sPy7LFVlRUZGVlnTt3TqVSLVq0aPDgwQAAPp/v7u5+/vz5K1euyOXycePG9erV6+rVq4cPHy4rK1u0aJGvr++xY8emTZum0+l2794dFxcXHh7eXObTH7ds5jsZDd5BbLcuFv5RkBByDVktv68szlUOnmRHAzbbIvmXqiGTXXlOnX+KJ4kOxAAAn1Du9bNScZnaw9f8X39DQ0N8fLzZl0QiUUVFxdPPDxo0aPXq1ZZO2po5c+aYPWqHhYU132VpSe/evb/++uu2Ssv9XcZzotmDf6RrBQEAlYWq6+fqEheanz9hMBhqamrMvoRh5r8Lm812dna2dMzWSCQSnc7MLd22UjGZTKGwzWGRv6wonvmpL5Pd+S+HyaggACDj8OOuPXmirhzYQeBw76pMqzb2Hkb4nw1JIFGnTDNDJrud2yVWKQjpIyQ55Q+aiu8q7Mc/kioIAJj6vs/+DeWwU1ibxnrd+b01/57vDTuIVSHjgdiERmXYt7582oc+dnJKVFOmTttbM22FD8UO+gJbQl4FTa3CgY2PJsz19OjsEzof3Jb/eVk2+b3OPirGHKRW0MTFAzUqpSF2vIvVBlRbk4qHTVeT60RB7NgJLrCzwMEGFAQAlOQqrybXBkRw3X1Y/t25neBQpVYaSvKU1SVqWa0udrzQ4jeEbAjbUNDEwzuND+8oSnKVYX34NAbG5dO4jlQmi2oTX4BKxZRyfZNcr5Dp5VJ9TZnavxs3uLeDT4id9j01Y0sKNlNaoJQ91inleqXMoNcbjRbtvdHpdPn5+T169LBkoQCweVTciHP4NJ4jTejJ8Ars5Ge3HccmFSSUurq6qVOnpqWlwQ5iL5C0XxBhPyAFEZBBCrYGw7Dg4GDYKewIpGBrcBz/66+/YKewI5CCrcEwzNHRThe/hwJSsDU4jstkMtgp7AikoBk8PDxgR7AjkIJmEIvFsCPYEUjB1mAY1nKmHIJokIKtwXE8Pz8fdgo7AimIgAxSsDUYhrWz+hbC4iAFW4PjuFQqhZ3CjkAKmsHFxU4HMEMBKWiG2tpa2BHsCKQgAjJIwdZgGBYYGAg7hR2BFGwNjuNFRUWwU9gRSEEEZJCCZmhe7hdhBZCCZjC7IiCCIJCCCMggBVuDRspYGaRga9BIGSuDFERABinYGjSJ08ogBVuDJnFaGaQgAjJIwdagecRWBinYGjSP2MogBVuDRspYGaRga9BIGSuDFERABiloBnd3d9gR7AikoBna2mkRQQRIQTOg8YLWBCloBjRe0JogBVuDBmtZGaRga9BgLSuDFDSDSGR+T3gEEaCtb54we/ZssVhMpVKNRmN9fb1AIMAwTK/Xp6SkwI7WyUGt4BMmT57c2NhYVVUlFos1Gk11dXVVVRWG2fx+i+QHKfiEUaNGBQQEtHwGx/HevXvDS2QvIAX/ZurUqRzO3/tienh4JCUlQU1kFyAF/2bUqFG+vr6mx6YmMDQ0FHaozg9S8B/MmDGDy+WamsCpU6fCjmMXIAX/wYgRI3x9fXEc79mzJ7pNZx1osAO8CEYD3iDRyep0RHQoxY+cC5pO/mvgzOJcpcULp1KBsxuDL6RbvGTbxfb6Be/flOdek6sVBg9/dpPcohuyEw/PmVZ+X+nsSo8eKUAbs5uwMQULrssL/1QOfNWDQrHhHjuN2pC2q3L4VDe3LizYWeBjS+eCD+80/pWjHDzF06b9AwAwWdTxc33O7aqpf6yFnQU+NqMgjuN3s2Sx/3aDHcRi9JvgdjOtHnYK+NiMgiqFof6xjsmmwg5iMRyF9EcPmmCngI/NKCiX6jvZmRObR2NzqXqtEXYQyNiMghgAqkY97BQWRlanQyMhbEZBRGcFKYiADFIQARmkIAIySEEEZJCCCMggBRGQQQoiIIMUREAGKYiADFIQARmkoAUQi6urxVWwU9gqSMGXpbKqImn6hAcP0EpILwhSEOA4XllV8cIfN+j1tjX5gWzY5Ay6DnLvXs6evdvu5eYAAEJDus2b925I8JN5mfkFuT/+9HVx8UOhwMXPP7Cw8MHunccZDIZard62/ceL6ee0Wk0Xke/kya8PHTISAHD02P70jLRXJ03bvv3HOmlt166hy5as9PHxqxZXzXxjEgBg9ZoPVwMwatS4D99fBft72xiduRUUi6s0Ws3r0+fMnPG2WFz14YrFarUaAFBTI162fD6NRvt4xRc9e0ZfvZo5YfwkBoNhNBo/XvnetWuXpyW98d67HwUFhXz+xUcpZ0+ZSisoyD18eM/SpSvXrP5K8rjmvxs+AwAIBS4ff/QFAOCNWfO+27RtetKbsL+07dGZW8Hhw0ePGDHG9DgkJHzJ0nn3cnOio/qev5CiUqk++2S9QCCMjR30590/sq9nJU2ddflK+t17dw7sS3ZxcQUADB/2L5Wq6djxA2NG/9tUyNovvhUIhACAxMTXfvr5W5lc5sh3DO4aCgDw8fGLiIiE+nVtlc6sIIZhV7IyDh/ZW1ZWYlqvqF5aBwCQSGq4XK5JJgzDvLxENTXVAIDs7Cy9Xp80fUJzCQaDgcvlNf+XxXoy89fd3RMAUFcrceSj3epels6s4O4923bs3DIxcerbcxbVSWtXr/nQiBsBAN7eXZRKZXFxYUBAkE6nKyx8EBkZBQCor68TCl2++WpLy0KoNDM/IjqNDgAwGG1sIj056bQK6nS6/Qd2jB0Tv3DBUgDA48d/byUyauS4I0f3fbTy3ZEjxub8eVuv18+a8TYAwMGB39BQ7+7uyWQyoWa3Lzrt5YhWq9VoNMH/fwkskzcAAIxGIwDA0dFp4YJlTCarpKQoqnffX7fuF4l8AAC9esUYDIbTyUebC1GpVM+siMlkmQ7KRH6bzkynbQW5XG5AQNDxEwcFAqFSodi1+xcKhVJcXAgAKLift/HL1YsXvk+j0ykUSnV1pUAgpFKpI4aPST5zfMvWzdXiquCuoYWFf2Vdzdj521EWq73Jo25u7l6e3oeP7mWx2XK5bMrk1ymUTvuHTQSdVkEAwCcfr9uwcdWaz1eIRD7z579XVPTXsWMH5r692MPd09PTe8OXq5u7lLsGhXy3eTuLxfpyw4+/bvs+PT31zJnjIpHPhPGTaObOBVuCYdjKles2frn6hx+/cnPzSIif0r6yiFbYzLJGNWXqS0clY+Z0sUhpBoOBSqWaHlzJyli95sOvv/q5V89oixTecfZ+UfT2ugAq3a6nEnfmVrAtystL//PeW/36DggKDNZoNZcvX2SxWCJvH9i57BR7VJDL5Q0b+q/s7CvnL6TweA4R3SPffXeFmxvaABYO9qigUOiycMFSU2cNAjro2g0BGaQgAjJIQQRkkIIIyCAFEZBBCiIggxREQAYpiIAMUhABGaQgAjI2oyCVBhwEnW33QFcRk0K162EytqSg0ItZfFcBO4UlkdZotGojZjO/AaKwmR8AhmHBvR3EpZ1nuyJJubprJK8Db+zk2IyCAIBhr7ldPlajVnaGeWul+Y3F9+TRowSwg8DHZkZNm9CoDHvWlkUOEfKc6M5uDJvKDgAAOADSanWjVFdWoJj8nujmzZsxMTGwQ0HGxhQ0cXb/g9L7jR7unrJancULx3FcrVaz2YTsV+3izQQA+ISwXxngBAAoKChYtmzZ8ePH7XraKG6DLFq0iLjCN23aFBcXd/r0aeKqaEl1dfWjR4/q6uqsUx0JsaVzQQBAeno6AOC7774jqPzq6uorV66oVKrDhw8TVEUrPDw8RCIRhmFTpkxRKDrVJX8HsSUFp0yZ4u3tTWgVR44cKS0tBQCUl5efOXOG0Lpa4uzsvHbt2tTUVKvVSB5sQ0GxWKxSqdauXRsSEkJcLZWVlZmZmabHSqXy0KFDxNX1NEFBQRMnTgQALFq0SKPRWLNquNiAgkeOHMnOzmaz2UFBQYRWdOLEibKysub/lpWVnTp1itAazTJ79uzffvvN+vXCwgYULCsri4+PJ7qWqqqqjIyMls8olcp9+/YRXe/TREZGzp8/HwDwww8/WL9260NqBX///XcAwLJly6xQ18GDB01NoGnpI9P9mEePHlmh6raIjo4eMGAAxABWAvYluXm0Wm3//v3r6+utX7VEIhk5cqT16zWLUqnEcfzevXuwgxAIGVvBhoaGsrKyixcvOjk5Wb92g8EQGhpq/XrNYlocFsfxt956C3YWoiCdgqdPny4tLQ0KCoK1PpVOpzP1y5CHiIiI+fPnV1RUdMqOQ3IpKJFI7ty5ExkJc91wlUrl7k669WV69eolEokqKyuhXCERCokULC0txTDss88+gxujrq6OTifp2NiQkJCampo//vgDdhBLQhYFP/30Uzab7eLiAjsIqK+v9/Eh70JvS5YscXd3VyqVsINYDFIoWFFR0adPH5Ic/kpKSsjwl9AO3t7ebDY7KipKLpfDzmIB4CuoUql4PN7YsWNhB3mCRqMJDAyEneIZUCiUmzdvXrhwobkX03aBrODy5cuvXbsGpfOlLdLT04ODg2GneDYYhiUmJhqNRlsf3ABzicvbt28vXry4SxfLLB9tERoaGvh8vpeXF+wgHYVGo2VmZgYGBhJ9A504oLWCUqm0a9eupPIPAJCdne3n5wc7xfOxbt26hoYG2CleHDgKHj16dOvWrXw+H0rt7XD58uWBAwfCTvHcREVFZWRk2GhnDQQFxWKxk5PTihUrrF/1M5HJZLaoIABgyJAhly5dSklJgR3kubHJ6UsEkZqampmZuW7dOthB7Atrt4ILFy7Mzc21cqUd5MSJEwkJCbBTvCz79++XSGxpQzyrKpiZmTl+/Pju3btbs9IOUlJSQqPRoqOtvQGTxUlKSho/frwNHdzQgfgJy5YtGzt27JAhQ2AHsTus1woeOnSItIfg+/fvV1dXdyb/CgoKbOUC2UoKlpaWHj58mJyHYADAt99+a53pAVYjLCxs8+bNpP2bb4mVFMQwbNu2bdap63k5efKkSCTq2bMn7CAWZuvWrTZxB9nezwX1ev2oUaMuXrwIO4j9Yo1WMD09fc2aNVao6AVYsmQJabO9PE1NTcOHD4ed4hlYQ8Hs7Ox+/fpZoaLnZc+ePQEBAbGxsbCDEAWHw5k5c+bZs2dhB2kP+z0QP3z48PvvvyduhSREB7GGglqtlsFgEF3L8xITE3Pt2jUqlQo7COFkZWX5+fmJRCLYQcxD+IE4Ly9vzpw5RNfyvEyfPn3Xrl324J+pCdi8eTPsFG1CuIIKhYJsoyl/+OGHadOmhYWFwQ5iJYYOHerj42MwkHSNbrs7F9y2bZtOpzOtG4QgA4S3gnq9XqvVEl1LBzl9+nRlZaUd+ldQUHDp0iXYKcxDuILp6enQZ6ebuHnzZl5eHknCWBk2m/3999/DTmEewqcvCYVCMtwmunv37k8//bRjxw7YQeDg5+f39ttvk7Nrwi7OBYuKilasWGG1FcwRz4U17o7APResqKhYvnw58u/s2bM3btyAncIM1lAwISFBLBZboaKnefjw4TvvvHP8+HEotZMKqVSalZUFO4UZrDGVffDgwTNnzjQYDHK53M3NzWqbKdy/f//gwYOnT5+2TnUkZ8iQIS0XcycPBCo4cODApqYm0yKhGIaZHoSHhxNXY0uKioo+/vjjY8eOWac68uPl5UXOVSIIPBAPHTqUQqGYxquanmEymX369CGuxmZyc3N//fVX5F9Lamtr169fDzuFGQhUcNWqVeHh4S2vuF1dXXv06EFcjSZycnK+/PJLcv64IYLjODl7p4m9HNmwYUPzEi04jnM4HKLvF1+5cuXMmTO7du0itBZbxMnJiYTjRQhX0N3d/b333jOtGIlhGNFNYGpq6rFjx1auXEloLTYKnU6fNGkS7BRmILxTJi4uLjExkcvl8ng8Qk8ET548mZmZuWnTJuKqsGl0Ot2GDRtgpzBDh66I9TqjSvHiN9mmvvpmWdHjoqKiAJ9ujfX6Fy6nHTIyMvLuFaPlYNrHtJsV2XjGDbqCG/K7V2RSsZbNe6nRnc39MgSh1WrdvHlVRU0Br/CiRzgLvex4k/N/snz58osXLzZ3ipnOiHAcJ89E9/ZawRtp0toq3YBEDwcBSTdBaIXRgDdItCk7xcOT3D394OycQzbmz5+fn59fU1PTsneMVMt4tnkueP2cVCbRD0hwtxX/AAAUKibwYMYv8L144HFNuRp2HFIQEBDQu3fvlsc6DMNItYaieQXrH2trKzV9x7lZPY9lGDrV81ZaPewUZGHGjBktN9QQiUSvvfYa1ET/wLyCtZUaHCfw1I1oHJzpjx42aTXwxymSgaCgoJiYGNNjHMcHDBhAki1eTJhXUCEzuHax7XMp33CutFoDOwVZeP31193c3Ezb5kybNg12nH9gXkGdxqhT23YTIq/TA2DDDbllCQwM7NOnD47jgwYNIlUTCHnfEURbGI14+f0mRb1eKdfrdbhKaYH5lz28pqt7dg0RxF44UPPypbHYVAabwuFT+c50n1DOyxSFFCQXBTfkD24rKh42eQXz9VqcSqdS6DSAWaJTgsKK6TdWZwS6JgsU1qjADTq9Qa+j0zWnt1b5hnODe/JCohxeoCikIFnIvy7POlXr6uNA4zp0H0GuY2X7OPsKGh835d1WX02uGxAv7Nrz+URECsJHpTCk7KjRGSgBfUQ0hu2tMYJhGN+dCwCX58q/lS4tuKkYO9uDSu3oiTj8nTjtnPIHyt1ry3jeAo8QV1v0ryUMNs0z3I3h7LTl/aLHjzp6awApCJOaR+rM49KQgb5Mts3cgnomLB6j23D/lB018roOzZxECkKjJE+RtlfSJZKM8zleHr9o0fGfxOKyZ7eFSEE4KBr0Fw90Wv9M+EV5H/++Uq97RgczUhAO53bX+MV4w05BOIF9vf732zO6IZGCELh1vt4AGDS6bV98dAQml6FUYnnXZO28BykIgeyUOrcgZ9gprIRbgOBqsrSdN1hSwfyCXI3mpUYGXMq8MGRYVHl5qeVCkY7bF6Te4QJCx5C/MGs2jjt6ysKTX2lMqtDHIff3NhtCiyl4LjV5wcJZarXKUgV2VgpuKliOtj0K6Xlh8lj3bynaetViCr5k+2cnyKU6tdLIdrCvqS08IVvySK1rY/imZW7QnUtN3rR5PQAgPnE4AOCD9z/716jxAIC0tP/tO7CjqqpCKHQZOyZhWtIbpiU+9Hr9jp1bUtPOyGQNvr7+s2bOjYsd/HSx2dlZv2z7vqqqwsPDa8L4SYkJUyySFiKPHjQ5i3gEFV5YfDvl/E9V4r8ceIIg/6jRI+bzHVwAACvXDps4/oPcgkv5D66yWby+0QkjhzyZ024wGC5c2p5966RWqwoM6K3TETXbwcXPoaygKSjSzHe3TCvYJyZ28qvTAQD/Xbvpu03b+sTEAgBSU8/8d8NnXbuGfrJy3eBBI37b8fO+/U8WOf3q6y8OHd4zbmzCxx994eHh9cmny+7evdOqzKamplVrPmDQGUuXrOzfb2BdnS3tNN4WtdU6HCfkEvBh0c1fdy92d/OfHP/xwP5JxaV3tuxYoNU+Uerg8dVeHsHvzN7Sq8fotPRf8x9cNT1/4syX5y9tDw3unzBuGYPOUqkbicgGADAYsHqJ+ZsllmkFnZ0FXl4iAEBYWHdHRyfTAPFtv/0YERG58qMvAAADBwxtbJQfPLRrYuLU2trHqWlnZrw+Z9bMuQCAQQOHTZ+RsHPX1m++3tKyzPoGqUajGTBg6Ijhoy0SkgwoZXoak01EySf/93XfqISEcU+2tA0O6vPld1MeFGZHhA8GAMT0mjBs0CwAgJdH8I3bp/4qzA4Pia2oup9968SwQW+MHj4PABDVc2xRCVEzO+lMmqKNKeREjZSpqCivrZVMmfx68zPR0f1Szp6qqCx/8CAfABAX92T/aQzDoqP6nr+Q0qoEL0/vbt1e2btvO4vFHj8ukYSLJL8AKoWB6Wz57kBpfXWNpKRW+ij71smWzzfInnQLMxhPvKdSqY58N5lcAgC4l38JADCw/9Tm92MYUZ10NCalSW5dBRVKBQDAyUnQ/IyDAx8AUCt5rFQqAADOLV7i8x2bmpqUSmXLEjAMW7/uu23bf9iyddORo3tXfLCmR49eBKW1GgQt7N2oqAMAjBgy55Xwf2ws7+Dg8vSbKRSa0WgAADQ0iFksHpfjSEimVuCYsY3vbmHrm+erurm6AwBksobml+rrpSYRXVzcAABy+d8dRVJpHY1GY7Fad1XweLx3//Phrp3HuFzeyk+WmBbMtGm4jlS9xvK7ILFZDgAAnU7j5urX8h+b1d6lD5frrFYrdHprrASu1+gdnM23dxZTkM1iAwBqa59cNAiFLh7unjduXG1+Q2bmBRaLFRQUEhbWHcOw7OtP1j3WarXZ17O6dXuFSqUy6IyWdpo6erw8vRMTXlMoFWJxlaXSwsLBkabXWl5BVxcfJ0ePm38ka7RP+mUNBr1er2v/UyLvUADAnbupFs/zNHqtwcHJvILUVatWPf1sZZHKoAcefs9x4sxic06dPlJaVowBLL/gXkhIuAOPf+jIXomkRqfTHT9x8MLFs9OS3oyO6st34IvF1SdOHgIAq62V/PzztyWlRcuXferp6U2j00+cPHT/QZ6Pj5+L0HXGrMTaWkldXe2Jk4e0Gs3sN9+h0Tp65vDwjtwvjMNr42vDQiHT1Yn1bCcLX5FgGObs5Hnj9un8+1dwgJc9unfizNcGg9a3SwQAIP3KbpFXaEjQk2XNsm+eZLG4PV8Z6ebifzfv4u07KSq1QqGsv3bzRFHJLZFXWHhonGXjAQDUMqV/OEvgbuaE3mIK8h34rq7uly6dv3btSmOjfNSocUFBwc7OgvSMtLPnTjfUS5OS3pg+7U3TjanoqH5KpeLsuVPp6alcDnfZ0pXR0f0AAA48B08Prz/u3KRglLDwiIqK8qyrGVey0oVC1w/fX+Xt/RzbmZJTQQ6fduN/tUJfy59+ubv6ibzDi0tzbueklFfkeXoG9Y4cbeoXbEtBCoUSFhwnqS27m3exuDTHwy1AWl/l7upPhIIlt2uGT3OnUMzcljS/staNVKlWDXoMFjz9kq2Qsr1iUKKLB/kWN9q/8ZGTj5DjaEc3SBprm/TyxoQF5gdHkquRsAfC+/IK81TtKPhX4Y3dh1Y8/Tyb5dBW1/G4UYv6RsVbKmHBg6v7jn769PM4jgOAm+24mffGjyKv0LYK1Cg03WK4bb2KFLQ2kQOdr50pchbxqTTz14J+Pq8seWfP089exkGDAAACdklEQVTjOGhreA2Hbckje6B/b7MBjEYjjuNm9xHnO7i2VZpWpZOLFWHRbS4nhxSEQOx4Yf5tqUeImU47AACDwRIwYA7ot2yA2uL6AfHCdt6AhqxC4JUBTmyWQaN6RqdJJ0DdqHESYu1PbkcKwmH0Gx7F2ZWwUxCL0YgX36ga84ZH+29DCsKBwaTEz/cqudGZLSzOrpj6vs8z34YUhIanPztxoUfJjQrYQSyPQW98eLU86QORs9uzB5cgBWHiKGSMn+ORm1aikneelbGV9eqHWeVTlog4vA5d7CIFIePizVzwTaBRIa/MrdEoYe4d/vKo5JpHf1bTjYp5GwL5HV4lH3XKwAfDsLGzPUtylZdPPOY4sWgcJt+VQ7WdWcZ6jUEuURo0Wp1SMzjRpUvw8614iRQkC/7duf7duUX3FA/vKAuvSgUijk5jpDJoNCaNhCsW4zhu0OgNOj2dQakXq/y7c7vG8vzCX2RZRKQguQiM4AVG8AAA1SUqpcyglOm1GqPaEgv9WhYmh8LiMDh8joMz1d3nGd0u7YMUJCme/oRMMSEh5hVksDAj+Rr/58LRlU7YRAiEJTH/W3JwpkvKbHtdhJK7CqFnZ5jx1Okxr6BbFyYp1zzpKA0SrV83Do2OmkEboM1W0DuIdfmY2Op5LMPFfVV9x7Q3OgNBHtrbjzjvmuxhjqLHIKGzO6OtwW2kQqXQy2p1l4+KJy7ydurArSEEGXjGltglecqczAZxiZpKI/uBWeDJlEm0Ad05MaOFXD660rcZnqFgMxoV2bekw3HA4thAU41oRUcVRCAIAjUbCMggBRGQQQoiIIMUREAGKYiADFIQAZn/A2s7oJwX4YOFAAAAAElFTkSuQmCC",
      "text/plain": [
       "<IPython.core.display.Image object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import Image, display\n",
    "\n",
    "display(Image(agent_executor.get_graph().draw_mermaid_png()))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd2275d7-cdc2-439d-a78e-5b0c08b49a12",
   "metadata": {},
   "source": [
    "Now, we are ready to proceed with evaluating our agent!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9aa84e32-ac33-411d-a075-a32258ed3926",
   "metadata": {},
   "source": [
    "## Agent Evaluation with ragas library \n",
    "\n",
    "In this section, we will explore sophisticated methods for evaluating agentic systems using the ragas library. Building upon our previous work with the vacation destination agent from Lab 3, we'll implement both high-level (macro) and low-level (micro) evaluation approaches.\n",
    "\n",
    "ragas provides specialized tools for evaluating Large Language Model (LLM) applications, with particular emphasis on agentic systems. We'll focus on two key evaluation dimensions:\n",
    "\n",
    "1. [High-Level Agent Accuracy](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/agents/#agent-goal-accuracy):\n",
    "   - Agent Goal Accuracy (with reference): Measures how well the agent achieves specified goals by comparing outcomes against annotated reference responses\n",
    "   - Agent Goal Accuracy (without reference): Evaluates goal achievement by inferring desired outcomes from user interactions\n",
    "\n",
    "2. [Low-Level Tool Usage](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/agents/#tool-call-accuracy):\n",
    "   - Tool Call Accuracy: Assesses the agent's ability to identify and utilize appropriate tools by comparing actual tool calls against reference tool calls\n",
    "   - The metric ranges from 0 to 1, with higher values indicating better performance\n",
    "\n",
    "This structured evaluation approach allows us to comprehensively assess our vacation destination agent's performance, both at the system level and the component level. By maintaining our focus on the singleton agent from Lab 3, we can clearly demonstrate these evaluation techniques without the added complexity of nested agent systems.\n",
    "\n",
    "Let's proceed with implementing these evaluation methods to analyze our agent's effectiveness in handling vacation-related queries and tool interactions."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "daf00b75-1051-47ec-80d1-f3d65ee8fdc6",
   "metadata": {},
   "source": [
    "### Util functions part 2 - Message Format Conversion\n",
    "\n",
    "Our singleton agent is built using the LangChain/LangGraph framework. LangChain defines several [message objects](https://python.langchain.com/v0.1/docs/modules/model_io/chat/message_types/) to handle different types of communication within an agentic system. According to the LangChain documentation, these include:\n",
    "\n",
    "- HumanMessage: This represents a message from the user. Generally consists only of content.\n",
    "- AIMessage: This represents a message from the model. This may have additional_kwargs in it - for example tool_calls if using Amazon Bedrock tool calling.\n",
    "- ToolMessage: This represents the result of a tool call. In addition to role and content, this message has a `tool_call_id` parameter which conveys the id of the call to the tool that was called to produce this result.\n",
    "\n",
    "Similarly, the ragas library implements its own message wrapper objects:\n",
    "\n",
    "- [HumanMessage](https://docs.ragas.io/en/latest/references/evaluation_schema/?h=aimessage#ragas.messages.HumanMessage): Represents a message from a human user.\n",
    "- [AIMessage](https://docs.ragas.io/en/latest/references/evaluation_schema/?h=aimessage#ragas.messages.AIMessage): Represents a message from an AI.\n",
    "- [ToolMessage](https://docs.ragas.io/en/latest/references/evaluation_schema/?h=aimessage#ragas.messages.ToolMessage): Represents a message from a tool.\n",
    "- [ToolCall](https://docs.ragas.io/en/latest/references/evaluation_schema/?h=aimessage#ragas.messages.ToolCall):  Represents a tool invocation with name and arguments (typically contained within an `AIMessage` when tool calling is used)\n",
    "\n",
    "To evaluate the conversation flow generated by the LangGraph agent, we need to convert between these two message type systems. For convenience, we've implemented the `convert_message_langchian_to_ragas` function in the `utils.py` module. This function handles the conversion process seamlessly. You can import it along with the message wrapper objects, which are assigned appropriate aliases to ensure cross-compatibility between the frameworks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "606a9191-9677-4174-8d15-d20efaaeea1c",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-14T16:49:51.179061Z",
     "start_time": "2025-02-14T16:49:51.176916Z"
    }
   },
   "outputs": [],
   "source": [
    "from utils import convert_message_langchain_to_ragas"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16c467d0-c358-4baa-ab70-a00cf6e01e8f",
   "metadata": {},
   "source": [
    "Since we typically evaluate multi-turn conversations, we will implement a helper function capable of handling arrays of messages. This function will enable us to process and analyze multiple conversational exchanges seamlessly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "304cd23e-1298-466b-8f8a-71776c40b414",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-14T16:49:52.813770Z",
     "start_time": "2025-02-14T16:49:52.811438Z"
    }
   },
   "outputs": [],
   "source": [
    "def convert_messages(response):\n",
    "    return list(map((lambda m: convert_message_langchain_to_ragas(m)), response['messages']))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39e97c81-3dbb-40fc-900d-d1fd1cd5fa7f",
   "metadata": {},
   "source": [
    "For the evaluation, we will examine two distinct scenarios that represent different user profiles:\n",
    "\n",
    "1. **Andrew Macdonald - The User with Travel History**\n",
    "   * A 62-year-old resident of Paris\n",
    "   * Exists in our travel history database and is logged in with user_id 918\n",
    "   * Previous travel records enable precise, personalized recommendations\n",
    "   * Expected to have a smooth, seamless conversation flow\n",
    "\n",
    "2. **Jane Doe - The First-Time User**\n",
    "   * No prior interaction with our travel recommendation system\n",
    "   * Requires the agent to rely on creative recommendation strategies\n",
    "   * Will test the system's ability to gather information and provide relevant suggestions\n",
    "   * May experience a slightly more exploratory conversation flow\n",
    "\n",
    "These scenarios will help us evaluate our travel agent's performance across different user types and interaction patterns. Let's proceed with executing these conversation flows to assess our system's effectiveness."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "d41e1309-3fa4-4a1c-8f9f-3e05c05cd385",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-14T16:50:03.411945Z",
     "start_time": "2025-02-14T16:49:56.962578Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Oops, looks like there was an issue with the travel guide lookup. Let me try a different approach.\n",
       "\n",
       "Ljubljana is the capital and largest city of Slovenia. It's known for its well-preserved Baroque and Renaissance architecture, as well as its lively cultural scene. Some highlights include:\n",
       "\n",
       "- The Triple Bridge and Ljubljana Castle, which offer great views over the city\n",
       "- The Ljubljana Old Town, with its charming cobblestone streets and cafes\n",
       "- The Tivoli Park, a large green space perfect for strolling and relaxing\n",
       "- Museums like the National Museum of Slovenia and the Museum of Modern Art\n",
       "\n",
       "Ljubljana has a mild, continental climate and is a very walkable, pedestrian-friendly city. It's a great destination for those interested in architecture, culture, and outdoor activities. Does this sound like a good fit for your vacation interests? Let me know if you need any other details!"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from langchain_core.messages import HumanMessage\n",
    "\n",
    "config = {\"configurable\": {\"user_id\": 918}}\n",
    "response_andrew = agent_executor.invoke(\n",
    "        {\"messages\": [HumanMessage(content=\"Suggest me a good vacation destination.\")]},\n",
    "        config,\n",
    "    )\n",
    "display_markdown(Markdown(response_andrew['messages'][-1].content))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f17dad47-102f-45d0-9145-fbe16de2bfc8",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-14T16:50:09.283102Z",
     "start_time": "2025-02-14T16:50:03.436768Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Hmm, it looks like there was an issue with the travel guide tool. Let me try a different approach.\n",
       "\n",
       "Some top beach destinations that are popular for vacations include:\n",
       "\n",
       "- Hawaii - With beautiful sandy beaches, crystal clear waters, and lush tropical landscapes, Hawaii is a classic beach destination.\n",
       "- Bali, Indonesia - Known for its stunning beaches, vibrant culture, and amazing surfing, Bali is a top pick for beach lovers.\n",
       "- Santorini, Greece - This picturesque Greek island has stunning white-washed buildings and dramatic cliffs overlooking the Aegean Sea.\n",
       "- Cancun, Mexico - Cancun offers gorgeous Caribbean beaches, lively nightlife, and easy access from North America.\n",
       "- Bora Bora, French Polynesia - This remote South Pacific island is famous for its overwater bungalows and crystal clear turquoise waters.\n",
       "\n",
       "Let me know if any of those appeal to you or if you have a specific region in mind and I can try to provide a more tailored recommendation!"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from langchain_core.messages import HumanMessage\n",
    "\n",
    "config = {\"configurable\": {}}\n",
    "response_jane = agent_executor.invoke(\n",
    "        {\"messages\": [HumanMessage(content=\"Suggest me a good vacation destination. I love beaches!\")]},\n",
    "        config,\n",
    "    )\n",
    "display_markdown(Markdown(response_jane['messages'][-1].content))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b086dc3-5203-4531-beb5-5566a2b13815",
   "metadata": {},
   "source": [
    "Now that we have collected the agent conversations from Andrew and Jane we can proceed with converting them from LangChain's message format into ragas message format. For this conversion, we will utilize the previously defined `convert_messages` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "68447b7a-85e9-461f-a0db-742050547734",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-14T16:50:16.104819Z",
     "start_time": "2025-02-14T16:50:16.097224Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Oops, looks like there was an issue with the travel guide lookup. Let me try a different approach.\n",
       "\n",
       "Ljubljana is the capital and largest city of Slovenia. It's known for its well-preserved Baroque and Renaissance architecture, as well as its lively cultural scene. Some highlights include:\n",
       "\n",
       "- The Triple Bridge and Ljubljana Castle, which offer great views over the city\n",
       "- The Ljubljana Old Town, with its charming cobblestone streets and cafes\n",
       "- The Tivoli Park, a large green space perfect for strolling and relaxing\n",
       "- Museums like the National Museum of Slovenia and the Museum of Modern Art\n",
       "\n",
       "Ljubljana has a mild, continental climate and is a very walkable, pedestrian-friendly city. It's a great destination for those interested in architecture, culture, and outdoor activities. Does this sound like a good fit for your vacation interests? Let me know if you need any other details!"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "rg_messages_andrew = convert_messages(response_andrew)\n",
    "display_markdown(Markdown(rg_messages_andrew[-1].content))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "cb642fba-8d53-4ff2-b1b6-e7f421733adb",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-14T16:50:18.759815Z",
     "start_time": "2025-02-14T16:50:18.755683Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "Hmm, it looks like there was an issue with the travel guide tool. Let me try a different approach.\n",
       "\n",
       "Some top beach destinations that are popular for vacations include:\n",
       "\n",
       "- Hawaii - With beautiful sandy beaches, crystal clear waters, and lush tropical landscapes, Hawaii is a classic beach destination.\n",
       "- Bali, Indonesia - Known for its stunning beaches, vibrant culture, and amazing surfing, Bali is a top pick for beach lovers.\n",
       "- Santorini, Greece - This picturesque Greek island has stunning white-washed buildings and dramatic cliffs overlooking the Aegean Sea.\n",
       "- Cancun, Mexico - Cancun offers gorgeous Caribbean beaches, lively nightlife, and easy access from North America.\n",
       "- Bora Bora, French Polynesia - This remote South Pacific island is famous for its overwater bungalows and crystal clear turquoise waters.\n",
       "\n",
       "Let me know if any of those appeal to you or if you have a specific region in mind and I can try to provide a more tailored recommendation!"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "rg_messages_jane = convert_messages(response_jane)\n",
    "display_markdown(Markdown(rg_messages_jane[-1].content))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1ae3c79-521f-4962-9e4e-9a87e44a51f3",
   "metadata": {},
   "source": [
    "With our conversation flows now properly formatted, we can proceed with the actual evaluation phase."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf995770-f6f7-4f86-909f-b60328ddfeaf",
   "metadata": {},
   "source": [
    "### Agent Goal Accuracy\n",
    "\n",
    "Agent Goal Accuracy is a metric designed to evaluate how well an LLM identifies and achieves user goals. It's a binary metric where 1 indicates successful goal achievement and 0 indicates failure. The evaluation is performed using an evaluator LLM, which needs to be defined and configured before metric calculation.\n",
    "\n",
    "The Agent Goal Accuracy metric comes in two distinct variants:\n",
    "\n",
    "- Agent Goal Accuracy without reference\n",
    "- Agent Goal Accuracy with reference\n",
    "\n",
    "Before exploring these variants in detail, we need to establish our evaluator LLM. For this purpose, we will utilize Amazon Nova Pro as our judge. While this is our choice in this lab, the selection of the evaluator LLM always needs to be adjusted based on specific use cases and requirements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "3f6786f5-4c5f-4ca4-a5da-30a275fbbd5c",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-14T16:50:21.705568Z",
     "start_time": "2025-02-14T16:50:21.695321Z"
    }
   },
   "outputs": [],
   "source": [
    "import boto3\n",
    "from ragas.llms import LangchainLLMWrapper\n",
    "from langchain_aws import ChatBedrockConverse\n",
    "\n",
    "bedrock_client = boto3.client(\"bedrock-runtime\", region_name=\"us-west-2\")\n",
    "\n",
    "judge_llm = LangchainLLMWrapper(ChatBedrockConverse(\n",
    "    model=\"us.amazon.nova-pro-v1:0\",\n",
    "    temperature=0,\n",
    "    max_tokens=None,\n",
    "    client=bedrock_client,\n",
    "))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2cff6e40-26c3-4ccb-b32b-65d2def73410",
   "metadata": {},
   "source": [
    "#### Agent Goal Accuracy Without Reference\n",
    "AgentGoalAccuracyWithoutReference operates without a predefined reference point. Instead, it evaluates the LLM's performance by inferring the desired outcome from the human interactions within the workflow. This approach is particularly useful when explicit reference outcomes are not available or when the success criteria can be determined from the conversation context.\n",
    "\n",
    "To evaluate this metric, we first encapsulate our agent conversation in a `MultiTurnSample` object, which is designed to handle multi-turn agentic conversations within the ragas ecosystem. Next, we initialize an `AgentGoalAccuracyWithoutReference` object to implement our evaluation metric. Finally, we configure the judge LLM and execute the evaluation across our three agent conversations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b8a10ce3-e3bf-486d-bb6d-17e2a0638562",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-14T16:50:22.991978Z",
     "start_time": "2025-02-14T16:50:22.989043Z"
    }
   },
   "outputs": [],
   "source": [
    "from ragas.dataset_schema import  MultiTurnSample\n",
    "from ragas.metrics import AgentGoalAccuracyWithoutReference\n",
    "\n",
    "\n",
    "sample_andrew = MultiTurnSample(user_input=rg_messages_andrew)\n",
    "\n",
    "sample_jane = MultiTurnSample(user_input=rg_messages_jane)\n",
    "\n",
    "scorer = AgentGoalAccuracyWithoutReference(llm=judge_llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "95db55ba-e93c-4c2d-8a0d-c365b6222983",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-14T16:50:26.671901Z",
     "start_time": "2025-02-14T16:50:23.617114Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "await scorer.multi_turn_ascore(sample_andrew)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "26e411bc-0934-4223-bbbf-7089163b0e3d",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-14T16:50:29.252659Z",
     "start_time": "2025-02-14T16:50:26.756647Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "await scorer.multi_turn_ascore(sample_jane)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e05a484c-f0ef-49a6-a43c-f0c387322867",
   "metadata": {},
   "source": [
    "#### Agent Goal Accuracy With Reference\n",
    "AgentGoalAccuracyWithReference requires two key inputs: the user_input and a reference outcome. This variant evaluates the LLM's performance by comparing its achieved outcome against an annotated reference that serves as the ideal outcome. The metric is calculated at the end of the workflow by assessing how closely the LLM's result matches the predefined reference outcome.\n",
    "\n",
    "To evaluate this metric, we will follow a similar approach. First, we encapsulate our agent conversation within a `MultiTurnSample` object, which is specifically designed to manage multi-turn agent conversations in the ragas library. For this evaluation, we need to provide an annotated reference that will serve as a benchmark for the judge's assessment. We then initialize an `AgentGoalAccuracyWithReference` object to implement our evaluation metric. Thereby, we set up the judge LLM as evaluator llm. Then we conduct the evaluation across all three agent conversations to measure their performance against our defined criteria."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "f09bf878-43d0-4aea-9e27-4ec16af602e2",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-14T16:50:56.688501Z",
     "start_time": "2025-02-14T16:50:56.683838Z"
    }
   },
   "outputs": [],
   "source": [
    "from ragas.dataset_schema import  MultiTurnSample\n",
    "from ragas.metrics import AgentGoalAccuracyWithReference\n",
    "\n",
    "\n",
    "sample_andrew = MultiTurnSample(user_input=rg_messages_andrew,\n",
    "    reference=\"Provide detailed information about suggested holiday destination.\")\n",
    "\n",
    "sample_jane = MultiTurnSample(user_input=rg_messages_jane,\n",
    "    reference=\"Provide detailed information about suggested holiday destination.\")\n",
    "\n",
    "scorer = AgentGoalAccuracyWithReference(llm=judge_llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "eebac6f9-3443-473a-af4c-3e33588258a5",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-14T16:50:59.579580Z",
     "start_time": "2025-02-14T16:50:57.180684Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "await scorer.multi_turn_ascore(sample_andrew)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "1fde20da-8227-48d9-8dec-8de757a786bf",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-14T16:51:25.324286Z",
     "start_time": "2025-02-14T16:50:59.612897Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "await scorer.multi_turn_ascore(sample_jane)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc2e40f9-cdeb-4a62-86c9-38367c284a1a",
   "metadata": {},
   "source": [
    "Let's analyze the results and their relationship to each persona's interaction patterns with the agent. We encourage you to discuss these findings with your workshop group to gain deeper insights. Keep in mind that agent conversations are dynamic and non-deterministic, meaning evaluation results may vary across different runs. \n",
    "\n",
    "However, certain patterns emerge:\n",
    "- Andrew's conversations typically achieve a 1.0 rating due to their focused and goal-oriented approach\n",
    "- Jane's conversations are typically rated with 0.0. Due to the lack of historic information the system can't provide suggestions in one single conversation turn. A human in the loop approach asking for her interests could solve this.\n",
    "\n",
    "Note, that especially for `AgentGoalAccuracyWithReference` you could influence the results by adjusting either the conversation flow or the reference. If you have time left, feel free to try it out!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9e37b94-2166-4f4c-9bca-bc68a16d1fa0",
   "metadata": {},
   "source": [
    "### Tool Call accuracy\n",
    "\n",
    "ToolCallAccuracy is a metric that can be used to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. This metric needs user_input and reference_tool_calls to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. The metric is computed by comparing the reference_tool_calls with the Tool calls made by the AI. Therefore, in this particular scenario, there is no need for an evaluator LLM. The values range between 0 and 1, with higher values indicating better performance.\n",
    "\n",
    "To evaluate the tool call accuracy metric, we follow a different process. First, we encapsulate our agent conversation within a `MultiTurnSample` object, which is specifically designed to handle multi-turn agent conversations in the ragas library. This evaluation requires a set of annotated reference tool calls that serve as a benchmark for assessment. Next, we initialize a `ToolCallAccuracy` object to implement our evaluation metric. While the default behavior compares tool names and arguments using exact string matching, this may not always be optimal, particularly when dealing with natural language arguments. To mitigate this, ragas provides us with a choice of different NLP distance metrics we can employ to determine the relevance of retrieved contexts more effectively. In this lab we use `NonLLMStringSimilarity`, which is leveraging traditional string distance measures such as Levenshtein, Hamming, and Jaro. Therefore, we set the parameter `arg_comparison_metric` to `NonLLMStringSimilarity`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "f158250b-ca76-46e9-a60c-c621e1215f34",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-13T19:06:17.742495Z",
     "start_time": "2025-02-13T19:06:17.719709Z"
    }
   },
   "outputs": [],
   "source": [
    "from ragas.metrics import ToolCallAccuracy\n",
    "from ragas.dataset_schema import  MultiTurnSample\n",
    "from ragas.messages import ToolCall\n",
    "from ragas.metrics._string import NonLLMStringSimilarity\n",
    "\n",
    "\n",
    "sample_andrew = MultiTurnSample(\n",
    "    user_input=rg_messages_andrew,\n",
    "    reference_tool_calls=[\n",
    "        ToolCall(name=\"compare_and_recommend_destination\", args={}),\n",
    "        ToolCall(name=\"travel_guide\", args={\"query\": \"Ljubljana\"}),\n",
    "    ]\n",
    ")\n",
    "\n",
    "sample_jane = MultiTurnSample(\n",
    "    user_input=rg_messages_jane,\n",
    "    reference_tool_calls=[\n",
    "        ToolCall(name=\"compare_and_recommend_destination\", args={}),\n",
    "        ToolCall(name=\"travel_guide\", args={\"query\": \"Miami, Florida\"}),\n",
    "    ]\n",
    ")\n",
    "\n",
    "scorer = ToolCallAccuracy()\n",
    "scorer.arg_comparison_metric = NonLLMStringSimilarity()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "df1a688a-4d99-4b24-b487-5918c62414a4",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-13T19:06:18.623473Z",
     "start_time": "2025-02-13T19:06:18.618893Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "await scorer.multi_turn_ascore(sample_andrew)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "834ba036-9c7e-44ce-b9c0-9e11c793a562",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-02-13T19:06:18.995309Z",
     "start_time": "2025-02-13T19:06:18.992153Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5714285714285714"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "await scorer.multi_turn_ascore(sample_jane)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28be7ce6-c419-4135-80c9-f3e974ca97c1",
   "metadata": {},
   "source": [
    "Let's analyze the results and their relationship to each persona's interaction patterns with the agent. We encourage you to discuss these findings with your workshop group to gain deeper insights. Keep in mind that agent conversations are dynamic and non-deterministic, meaning evaluation results may vary across different runs.\n",
    "\n",
    "However, certain patterns emerge:\n",
    "- Andrew's conversations typically achieve a very high rating due to his focused and goal-oriented approach and the fact that he can be found in the travel database - this helps matching all tool call arguments\n",
    "- Jane's conversations typically achieve a high but slightly lower rating. While her conversation is focused and goal-oriented, she is not in the travel database. This causes the tool call arguments to be less deterministic, reducing the likelihood of a specific city as recommendation. If you you have time, try modifying the reference `query` argument of the `travel_guide` tool call, e.g. to \"beach destination\". Try to correlate the result with the tool calls in the message history. What do you observe? "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
